Adding non-alerting data into PagerDuty?

Josiah_Ritchie · May 15, 2020, 2:10pm

I’m evaluating PagerDuty for our org. One thing we’d like to be able to do is have a stream of data that we regularly watch but don’t alert on easily accessible when looking at an alert. For example, in our mail system I’d like to see if your bounced and deferred email count has been rising on a specific server when I get an alert that no mail has gone through that server to direct my response. This would save me bopping my way through various systems gathering data and quicken my response.

Is this something that can be done (How?) or are we trying to solve a normal problem in an abnormal fashion? (What should I do instead?)

jay2 · May 15, 2020, 2:10pm

Hi Josiah,

I’m not sure if I am capturing your request correctly, so please let me know if I missed something. Could I suggest setting a method to be notified if their bounce rates for certain email servers have increased, instead of waiting for the notification that will indicate no email has gone through the server?

Is it possible that you send an email every time there is a bounce that occurs? If so, then you can use our Threshold Alerts feature to trigger a PagerDuty incident only when your customized alert conditions breach your specified limits.

So in your case, you could set a threshold alert to only trigger a PagerDuty incident if 10 (or any value) alerts have been sent in the past 10 (or any value) minutes.

If this feature sounds like it can help you, I recommend reading further on our knowledge base article.

Looking forward to hearing your input.

Cheers,

simonfiddaman · May 15, 2020, 2:10pm

Hi @Josiah_Ritchie,

Sounds like what you’re looking for is some kind of logging / metric aggregation (either Logstash + Kibana, or Splunk for logging aggregation and visualisation, or Grafana for metrics visualisation) and alerting based on that data.

You could, for example, run something like statsd or collectd on each mail server instance with a plugin that captures the metrics you’re looking for, and posts to Graphite (or implement Prometheus and make it a pull model). You could even implement the checks (i.e. metrics polls) in Nagios and use Nagios (or Sensu, et al) to generate the alerting. Grafana (over Graphite or some other metrics backend) can also alert you.

This is how we do it (or a combination of all of the above) - preferably at a global service level (i.e. mail inbound), but sometimes also on individual instances (which is more noisy).

If you’re looking for something that can aggregate and visualise your metrics but don’t want to have to set it up / host it yourself, look to some of the -aaS providers like Datadog (you can also use them as a PagerDuty alert trigger source).

In the end it’s all about how transparent you can make your own metrics - I like to think of exposing, aggregating and displaying metrics as your monitoring (Datadog, Grafana, Prometheus, etc.), and only once thresholds or trend analysis is applied, it is used for alerting (PagerDuty).

If your PagerDuty Services are specific enough, you can use Extensions -> Add-ons to include your dashboards inline in Services (but it’ll be bound to a Service, and visible in Incidents belonging to that Service, and you can’t specify inline display links dynamically with the alert - only statically per Service).

Hope that helps,
@simonfiddaman